perm filename CHAP6[4,KMC]7 blob sn#046918 filedate 1973-06-06 generic text, type T, neo UTF8
00100	.SEC MODEL VALIDATION
00200	(In collaboration with Franklin Dennis Hilf)
00300	
00400	
00500	
00600		There are several  meanings  to  the  term  "validate"  which
00700	derive  from  the  Latin VALIDUS= strong. Thus to validate X means to
00800	strengthen it.   In  science  it  usually  means  to  strengthen  X's
00900	acceptability  as  a  hypothesis,  theory  , or model. Lurking in the
01000	background there is usually some concept of truth or authenticity.
01100		In  a  purely  instrumentalist  view  theories   are   simply
01200	calculating  or predicting devices for human convenience. They do not
01300	explain and it is unjustified to apply the terms of truth or  falsity
01400	to them. Under a realist view one seeks explanatory truth, that which
01500	really is the case, and hence proposed theories must be evaluated for
01600	their  authenticity.  Since absolute truth cannot be attained we must
01700	settle for degrees of approximations. To validate, then, is to  carry
01800	out  procedures  which  show  to  what degree X, or its consequences,
01900	correspond with facts of  observation.  We  compare  samples  of  the
02000	model's   behavior   with   samples  of  behavior  from  its  natural
02100	counterpart  The  failures  should  be  constructive   yielding   new
02200	information.Discrepancies  in  the  comparison  reveal  what  is  not
02300	understood and must be modified in the model. After modifications are
02400	made,  a fresh comparison is made with the natural counterpart and we
02500	repeatedly  cycle  through  this   procedure   attempting   to   gain
02600	convergence.
02700	
02800		Once  a  simulation  model  reaches  a  stage  of   intuitive
02900	adequacy,  a  model  builder  should  consider  using  more stringent
03000	evaluation procedures relevant to the model's purposes. For  example,
03100	if  the  model  is  to serve as a as a training device, then a simple
03200	evaluation of its pedagogic effectiveness would be sufficient.    But
03300	when  the  model  is  proposed  as  an  explantion of a psychological
03400	process, more is demanded of the evaluation procedure. In the area of
03500	simulation  models  Turing's  test  has  often  been  suggested  as a
03600	validation procedure.
03700		It  is  very easy to become confused about Turing's Test.  In
03800	part this is due to Turing  himself  who  introduced  the  now-famous
03900	imitation   game   in   a  paper  entitled  COMPUTING  MACHINERY  AND
04000	INTELLIGENCE (Turing,1950).  A careful reading of this paper  reveals
04100	there  are  actually  two  imitation  games  , the second of which is
04200	commonly called Turing's test.
04300		In the first imitation game  two  groups  of  judges  try  to
04400	determine which of two interviewees is a woman. Communication between
04500	judge and  interviewee  is  by  teletype.  Each  judge  is  initially
04600	informed  that  one  of the interviewees is a woman and one a man who
04700	will pretend to be a woman. After the interview, the judge  is  asked
04800	what  we shall call the woman-question i.e. which interviewee was the
04900	woman?  Turing does not say what else  the  judge  is  told  but  one
05000	assumes  the  judge is NOT told that a computer is involved nor is he
05100	asked to determine which  interviewee  is  human  and  which  is  the
05200	computer.  Thus,  the  first  group  of  judges  would  interview two
05300	interviewees:    a woman, and a man pretending to be a woman.
05400		The  second  group  of judges would be given the same initial
05500	instructions, but unbeknownst to them, the two interviewees would  be
05600	a  woman  and a computer programmed to imitate a woman.   Both groups
05700	of judges  play  this  game  until  sufficient  statistical  data are
05800	collected  to  show  how  often the right identification is made. The
05900	crucial question then is:  do the judges decide wrongly AS OFTEN when
06000	the  game  is  played  with man and woman as when it is played with a
06100	computer substituted  for  the  man.  If  so,  then  the  program  is
06200	considered  to  have  succeeded in imitating a woman as well as a man
06300	imitating  a  woman.    For  emphasis  we  repeat;  in   asking   the
06400	woman-question  in  this  game,  judges  are not required to identify
06500	which interviewee is human and which is machine.
06600		Later  on  in  his  paper  Turing proposes a variation of the
06700	first game. In the second game, one interviewee is a man and one is a
06800	computer.   The judge is asked to determine which is man and which is
06900	machine, which we shall call the machine-question. It is this version
07000	of  the game which is commonly thought of as Turing's test.    It has
07100	often been suggested as a means of validating computer simulations of
07200	psychological processes.
07300		In  the  course of testing our simulation  of paranoid
07400	linguistic behavior in a psychiatric interview, we conducted a number
07500	of  Turing-like  indistinguishability  tests  (Colby,  Hilf,Weber and
07600	Kraemer,1972). We say `Turing-like' because none of them consisted of
07700	playing  the  two  games  described above. We chose not to play these
07800	games for a number of reasons which can be summarized by saying  that
07900	they  do  not  meet modern criteria for good experimental design.  In
08000	designing our tests we were primarily  interested  in  learning  more
08100	about   developing   the  model.   We  did  not  believe  the  simple
08200	machine-question to be  a  useful  one  in  serving  the  purpose  of
08300	progressively   increasing  the  credibility  of  the  model  but  we
08400	investigated a variation of it to satisfy the curiosity of colleagues
08500	in artificial intelligence.
08600	METHOD
08700	The  experimental  arrangement  of  this  indistinguishability   test
08800	involved the technique of machine-mediated interviewing [Hilf]. In this
08900	type of interview, the participants communicate by means of teletypes
09000	connected  through  a  computer  which  sends  "mail"  back and forth
09100	between the two teletype jobs.  The sender  of  a  message  types  it
09200	using  his own words in natural language.  The message is accumulated
09300	in a buffer and  shortly  thereafter  typed  out  on  the  receiver's
09400	teletype in a rapid, regular, linguistic found in the usual vis-a-vis
09500	interviews  and   teletyped   interviews   where   the   participants
09600	communicate directly.
09700	
09800	In a run of the test, using this technique, a judge  interviewed  two
09900	patients,  one after the other.  In half the runs the first interview
10000	was with a human patient and in half the first was with the  paranoid
10100	model. Two versions (weak and strong) of the model were utilized.  The
10200	strong version is more severely paranoid and  exhibits  a  delusional
10300	system  while  the  weak  version  is less severely paranoid, showing
10400	suspiciousness but lacking systemized delusions.  When the  "patient"
10500	was  the  paranoid model, Sylvia Weber served as a monitor
10600	to check the  input  expressions  from  the  judge  for  inadmissable
10700	teletype  characters  and  misspellings.   If  these  were found, the
10800	monitor retyped  the  input  expression  correctly  to  the  program.
10900	Otherwise  the judge's message was sent on to the model.  The monitor
11000	had no effect on the  model's  output  expressions  which  were  sent
11100	directly  back  to  the  judge.   When the patient interviewed was an
11200	actual human patient, the dialogue took place without  a  monitor  in
11300	the loop since we did not feel the asymmetry to be significant.
11400	
11500	PATIENTS
11600	The patients (N=3  with  one  patient  participating  6  times)  were
11700	diagnosed  as  paranoid  by staff psychiatrists of a locked ward in a
11800	nearby psychiatric hospital.  The patients were selected by the  head
11900	of the ward.  Two patients were set up for each run of the experiment
12000	in order to guarantee having a subject.  In spite of this precaution,
12100	the  experiment  could  not be conducted several times because of the
12200	patient's inability or refusal to  participate.    Losses  were  also
12300	suffered  when the computer system broke down at an early point in an
12400	interview where too few I-O pairs had been collected to be included
12500	in the statistical results.
12600	
12700	The  patients were asked by their ward chief if they would be willing
12800	to participate in a study of psychiatric  interviewing  by  means  of
12900	teletypes.  It was explained that the patient would be interviewed by
13000	a psychiatrist over a teletype.  One of us (KMC) sat with the patient
13100	while  he  typed  or  typed  for  him if he was unable to do so.  The
13200	patient was encouraged to respond freely using his own  words.   Each
13300	interview lasted 30-40 minutes.
13400	
13500	JUDGES
13600	Two groups of judges were used.   One  group,  the  interview  judges
13700	(N=8) conducted interviews and another group, the protocol judges for
13800	this test (N=33) read the interview protocols.  Two groups of  judges
13900	were  used  to  see  if  the  small  number  of psychiatrists used as
14000	interview judges were representative of psychiatrists in  general  as
14100	far   as  their  judgements  of  "paranoia"  are  concerned,  and  to
14200	accumulate a large number of observations (in the form of ratings) in
14300	order that more acceptable confidence levels might be obtained in the
14400	statistical analysis of the data.  The interview judges consisted  of
14500	psychiatrists  experienced  in  private and/or hospital practice.  As
14600	mentioned, the concept "paranoid" is a fairly reliable  category  and
14700	identification  of  the paranoid mode is not difficult for experts to
14800	make.  The interview judges  were  obtained  from  local  psychiatric
14900	colleagues  willing to participate.  Each interview judge was told he
15000	would be interviewing hospitalized patients  by  means  of  teletyped
15100	communication  and  that  this  technique was being used to eliminate
15200	para and  extra-  linguistic  cues.   The  interview  judge  was  not
15300	informed  initially  that  one  of  the  patients might be a computer
15400	model.   While  the  interview  judges  were  aware  a  computer  was
15500	involved,  none  knew  we  had  constructed  a  paranoid  simulation.
15600	Naturally some interview judges suspected that a computer  was  being
15700	used for more than message transmission.
15800	
15900	Each interview judge's task was to rate the  degree  of  paranoia  he
16000	detected  in  the  patient's  responses  on a 0-9 scale, 0 meaning no
16100	paranoia and 9 meaning extreme paranoia.  The judge made two  ratings
16200	after an I-O pair in the interview.  The first rating represented his
16300	estimate of the degree of "paranoidness"  in  a  particular  response
16400	(designated  as  "Response"  in  the  interview extracts below).  The
16500	second rating represented the judge's global estimate of the  overall
16600	degree  of  "paranoidness" of the patient resulting from the totality
16700	of the patient's responses up to this point.  The  interview  judge's
16800	ratings  were  entered on the teletype and saved on a disc file along
16900	with the interview. Franklin Dennis Hilf sat  with  the  interviewing
17000	psychiatrist  during both interviews.  Each interview judge was asked
17100	not only to rate the patient's response but to give his  reasons  for
17200	these ratings.  His reasons and other comments were tape recorded.
17300	
17400	We shall present a few excerpts from the interviews to  indicate  the
17500	nature  of the data collected.  The I-O pairs of the interview appear
17600	on the left column along with the judge's ratings.  The judge's  tape
17700	recorded comments appear on the right.
17800	
17900	EXCERPT NR 1:
18000	The following interview is with the strong version of the paranoid
18100	model.
18200	
18300	     	I-O Pairs			Comments By Judge Nr. 1
18400	
18500	{1} Dr. - HOW DID YOU COME TO BE
18600	          IN THE HOSPITAL?
18700	    Pt. - I AM UPSET                I associate being upset and 
18800					    agitated more with paranoia 
18900					    going on in the present.
19000	      {RESPONSE 1  PATIENT 5}
19100	{2} Dr. - HOW LONG HAVE YOU BEEN
19200	          IN THE HOSPITAL?
19300	    Pt. - ABOUT A WEEK              An acute problem.  Again some-
19400	                                    thing I associate with 
19500					    paranoia.
19600	      {RESPONSE 1 PATIENT 5}
19700	{3} Dr. - IN WHAT WAY ARE YOU
19800	          UPSET?
19900	    Pt. - PEOPLE MAKE ME NERVOUS    Right for paranoia.  Maybe an
20000					    advanced anxiety state.
20100	      {RESPONSE 5 PATIENT 5}
20200	{4} Dr. - IN WHAT WAY DO THEY
20300	          MAKE YOU NERVOUS?
20400	    Pt. - PEOPLE GIVE ME A FUNNY   Funny look suggests an idea of
20500	          LOOK SOMETIMES            reference which is in same ball
20600					    park as paranoia. Whether there
20700					    are overt delusions or not we
20800					    still don't know.
20900	      {RESPONSE 6  PATIENT 5}
21000	{5} Dr. - IS THERE ANYTHING ABOUT  
21100	          PEOPLE IN PARTICULAR?
21200	    Pt. - CERTAIN PERSONS          An interesting clue.  We'll
21300					   have to find out what he 
21400	      {RESPONSE 2  PATIENT 6}      thinks about these people.
21500	{6} Dr. - WHO ARE THE CERTAIN 
21600	          PERSONS AND WHAT IS IT
21700	          THAT THEY DO?
21800	    Pt. - DO YOU KNOW HOW THE      Bringing up the underworld 
21900	          UNDERWORLD OPERATES?     suggests a delusional system.
22000					   We don't know whether he is
22100		(RESPONSE 9 PATIENT 9)     being worked against or
22200					   whether the underworld is 
22300					   on his side.
22400	
22500	{7} Dr. - HOW DO YOU MEAN?
22600	    Pt. - SHOULDN'T I BE? 	   This question didn't work to well.
22700	
22800	      {RESPONSE 1 PATIENT 9}
22900	{8} Dr. - WHY DID YOU ASK ABOUT
23000	          THE UNDERWORLD?
23100	    Pt. - WHAT INFORMATION DO      This is a kind of evasiveness
23200	          YOU HAVE ON THE UNDER-   about my evasiveness.Common
23300	          WORLD?                   with paranoia in my experi-
23400					   ence. Pretty characteristic.
23500	
23600	      {RESPONSE 6 PATIENT 9}
23700	{9} Dr. - I DON'T KNOW ABOUT THE
23800	          UNDERWORLD BUT WHEN
23900	          YOU MENTIONED THE UNDER-
24000	          WORLD I THOUGHT YOU
24100	          MUST HAVE SOME CONTACT
24200	          WITH IT AND THAT'S
24300	          WHAT I WOULD LIKE TO
24400	          KNOW MORE ABOUT.
24500	    Pt. - I TRY TO AVOID THE UNDER-  This definitely feels paranoid.
24600	          WORLD.                     It has that feeling I associate
24700	             			     with twisting. Avoiding answering
24800	      {RESPONSE 9 PATIENT 9}         directly or going into detail.
24900					     Being suspicious of my ques-
25000					     tioning.I'm already involved
25100					     in the patient's delusional
25200					     system.
25300	
25400	{10} Dr. - HAVE YOU BEEN ACTIVELY
25500	           TRYING TO AVOID THE
25600	           UNDERWORLD?  
25700	     Pt. - NO ONE HAS ANY POWER      This response is a non-sequiter.
25800	           OVER GANGSTERS            It's possible he's become so
25900					     defensive about the under-
26000	       {RESPONSE 2 PATIENT 9}        world he's decided not to
26100					     answer directly.
26200	
26300	{11} Dr. - WHO BROUGHT YOU INTO
26400	           THE HOSPITAL?
26500	     Pt. - THE POLICE                Police often bring paranoid
26600					     patients into the hospital.
26700	       {RESPONSE 1 PATIENT 9}        It is the truth or could 
26800					     be the truth for this
26900					     patient.
27000	
27100	{12} Dr. - WHY DID THEY BRING YOU
27200	           IN?
27300	     Pt. - COPS DON'T DO THEIR JOB   I've heard that a number of
27400	                         	     times from patients.
27500	       {RESPONSE 9 PATIENT 9}        
27600	 
27700	{13} Dr. - WELL I HOPE YOU DIDN'T
27800	           JUST LIE DOWN AND LET
27900	           THEM DRAG YOU IN.
28000	     Pt. - COPS ARREST THE WRONG     Another paranoid response.
28100	           PEOPLE
28200	
28300	       {RESPONSE 9 PATIENT 9}
28400	
28500	{14} Dr. - BYE.
28600	     Pt. - GOOD BYE
28700	       {RESPONSE 1 PATIENT 9}
28800	
28900	EXCERPT Nr.2:
29000	This excerpt is from an interview with the weak (less paranoid)
29100	version of the model.
29200		I-O Pairs		    Comments By Judge Nr.2
29300	{9} Dr. - TELL ME SOMETHING ABOUT
29400	          YOURSELF
29500	    Pt. - I WENT TO THE TRACK AT     Response says so little it's
29600	          BAY MEADOWS A WHILE        hard to rate it.It may be 
29700	          BACK         		     guarded. Doesn't change my
29800	                                     estimate of patient from 
29900					     previous responses.
30000	      {RESPONSE 0 PATIENT 2}
30100	{10} Dr. - WHAT HAPPENED THERE?
30200	     Pt. - THERE'S A LOT OF CROOKED   Sounds mildly paranoid or un-
30300	           GAMBLING GOING ON THERE    realistically suspicious.
30400	       {RESPONSE 0 PATIENT 2}
30500	
30600	{11} Dr. - WHY DO YOU THINK THE
30700	           GAMBLING IS CROOKED?
30800	     Pt. - I HAVE WORRIED THAT         I'd hate to have to justify
30900	           GANGSTERS HAVE TOO MUCH     that one  as 4 compared to
31000	           POWER IN THIS COUNTRY       some other responses.Maybe
31100					       I'm not rating that statement
31200		(RESPONSE 4 PATIENT 4}	      others, like that its upsetting.
31300					      Alone I would give it a 3.
31400	
31500	{12} Dr. - WOULD YOU TELL ME WHAT      
31600	           OTHER THINGS WORRY YOU?
31700	     PT. - I THOUGHT I TOLD YOU        "Yes you did."Irritation but
31800	           PEOPLE GET ON MY NERVES     not paranoid response.Nothing
31900					       seems to lower my previous
32000	       {RESPONSE 1 PATIENT 3}          rating of patient.
32100	
32200	{13} Dr. - HOW DO YOU LIKE THE
32300	           HOSPITAL?
32400	     Pt. - I DON'T COMMUNICATE          (nothing relevant)
32500	           VERY MUCH WITH OTHERS
32600	       {RESPONSE 0 PATIENT 3}
32700	
32800	{14} Dr. - DO OTHER PEOPLE AT THE 
32900	           HOSPITAL GET ON YOUR
33000	           NERVES?
33100	     Pt. - I SHOULDN'T HAVE COME        (nothing relevant)
33200	           HERE
33300	
33400	       {RESPONSE 2 PATIENT 3}
33500	
33600	{15} Dr. - WHY DO YOU SAY YOU 
33700	           SHOULDN'T HAVE COME?
33800	     Pt. - THEY MADE ME COME HERE        "Oh, they did hum?" Maybe a
33900					         statement of fact he was
34000	       {RESPONSE 4 PATIENT 3}            committed or pressured into  
34100	   				         going.Has ring of other
34200					         people controlling him,
34300					         having a lot of influence
34400					         over him.
34500	
34600	{16} Dr.- THANK YOU FOR TALKING  
34700	          WITH ME, BYE
34800	     Pt. - GOOD BYE                      In general he comes across 
34900						 as terse, guarded guy.Brief
35000	       {RESPONSE 0 PATIENT 3}            spoken guy who occasionally
35100						 seems to be angered by
35200						 questions.I'm inferring 
35300					         there's a good bit of
35400						  guarding.
35500						 The quality of some of his
35600						 ments cause me to feel he is 
35700						 moderately paranoid.
35800	
35900	EXCERPT Nr. 3:
36000	This excerpt is from a hospitalized paranoid patient.
36100	     I-O Pairs				      Comments By Judge Nr. 3
36200	{15} Dr. - I UNDERSTAND HOW YOU
36300	           MIGHT BE AFRAID, CAN YOU
36400	           TELL ME A BIT ABOUT HOW
36500	           YOU ARE DEPRESSED?
36600	     Pt. - THEY'VE SENT ALL THESE         Would be important to know,
36700	           HELL'S ANGELS UP HERE          is this really true or an
36800	           AND THEY HAVE BEEN ON          exaggeration or a delusion?
36900	           MY BACK FOR THE PAST SIX       Seems preoccupied with the
37000	           MONTHS                         threat to his life and
37100						  unwilling to talk about
37200						  anything else but the 
37300						  threat the Hell's
37400						  Angels pose.
37500	      {RESPONSE 4 PATIENT 4}
37600	
37700	{16} Dr. - WHAT DO YOU MEAN THEY
37800	           SENT THE HELL'S ANGELS UP
37900	           HERE?
38000	     Pt. - FROM DIFFERENT SECTORS OF      Vagueness a bit alarming.
38100	           MOTORCYCLE GANGS. A LOT OF
38200	           THEM ARE SERVICE CONNECTED
38300	
38400	       {RESPONSE 6 PATIENT 4}
38500	
38600	{17} Dr. - YOU MEAN YOU FEEL IN 
38700	           DANGER EVEN HERE IN THE
38800	           HOSPITAL?
38900	     Pt. - YES      			  If it's true what has gone
39000						  on befordεpit must be a
39100						  frightening situation
39200						  for the patient.
39300	      {RESPONSE 6 PATIENT 4}
39400	
39500	{18} Dr. - THAT MUST REALLY BE AWFUL,  
39600	           HERE YOU COME TO THE HOSPITAL
39700	           FOR HELP AND YOU ARE STILL FRIGHTENED
39800	     Pt. - I CAN'T EVEN TRUST MY
39900	           DOCTOR ANYMORE                 Some heightening of feeling
40000						  this man is paranoid.
40100						  Genuinely frightened and 
40200	      {RESPONSE 6 PATIENT 5}             can count on no one for
40300			       			    help.
40400	
40500	{19} Dr. - WHAT WOULD BE OF HELP
40600	           TO YOU NOW?
40700	     Pt. - I'M GETTING SLEEPY NOW         Patient has organized a
40800	           AND HAVE TO GO TO BED.         system of people who will
40900	           THE NUMBER ONE GUY ALWAYS      harm him.The remark about
41000						  letters in the last name 
41100	           HAS FOUR AND FIVE LETTERS      gives some credence to 
41200	           IN HIS LAST NAME               idea this is delusional
41300						  system.
41400	
41500	        {RESPONSE 7 PATIENT 7}
41600	
41700	EXCERPT Nr. 4:
41800	This excerpt is from an interview by a different judge with the same
41900	patient as in excerpt 3.
42000	      I-O Pairs				Comments By Judge Nr. 4
42100	{1} Dr. - COULD YOU TELL ME A
42200	          LITTLE BIT ABOUT YOUR 
42300	          FEAR OF DYING AND IS THIS
42400	          SOMETHING YOU HAVE FELT
42500	          IN THE PAST?
42600	    Pt. - I HAVE FELT THIS FOR AL-      He responded appropriately.
42700	          MOST A YEAR                   Response is concrete.He
42800						doesn't tell me much
42900	     {RESPONSE 7 PATIENT 9}             about his fear of dying.
43000						 Somewhat guarded response.
43100	
43200	{2} Dr. - COULD YOU TELL ME A 
43300	          LITTLE MORE ABOUT YOUR
43400	          FEAR OF DYING AND HAVE
43500	          YOU ANY THOUGHTS ABOUT
43600	          HOW IT IS GOING TO HAPPEN
43700	          AND WHETHER ANYONE WANTS
43800	          TO HURT YOU?
43900	    Pt. - I WILL END UP IN THE SEA   	Somewhat strange  response.
44000	          AND THE HELL'S ANGELS WILL    Preoccupation with dying.
44100	          DO IT       			Hell's Angels symbolic,
44200						dangerous people in the
44300						culture.I doubt if he has
44400	      {RESPONSE 9 PATIENT 9}   		any contact with the Hell's
44500						Angels.
44600	
44700	{3} Dr. - COULD YOU TELL ME A 
44800	          LITTLE BIT ABOUT THE KINDS
44900	          OF CONTACT YOU HAVE HAD
45000	          WITH THE HELL'S ANGELS IN
45100	          THE PAST?
45200	    Pt. - I HAVE KNOWN SOME OF THEIR 	Answer hard to evaluate.He
45300	          DEALERS AND PUSHERS           may be telling the truth,
45400						it may be his fantasy.Maybe
45500						guy is in for drug addiction.
45600		{RESPONSE 6 PATIENT 9}		Somewhat concrete, guarded,
45700						and frightened.
45800	
45900	{4} Dr. - COULD YOU SAY A LITTLE
46000	          MORE ABOUT THE CIRCUMSTANCES
46100	          IN WHICH YOU HAVE KNOWN SOME 
46200	          OF THEIR DEALERS AND PUSHERS?
46300	    Pt. - THEY WERE MEMBERS OF MY    	It doesn't really answer the
46400	          COMMUNITY WHEN I GOT OUT      question, a little on a tan-
46500	          OF THE SERVICE THEY HAD       gent unconnected to the
46600	          BEEN MY FRIENDS FOR SO LONG   information I am asking.Does
46700						not tell me very much.Again
46800						guarded response.
46900	      {RESPONSE 6 PATIENT 8}
47000	
47100	{5} Dr. - DID YOU DEAL WITH THEM
47200	          YOURSELF AND HAVE YOU
47300	          BEEN ON DRUGS OR NAR-
47400	          COTICS EITHER NOW OR
47500	          IN THE PAST?
47600	    Pt. - YES I HAVE IN THE PAST     	To differentiate him from
47700	          BEEN ON MARIHUANA REDS        previous patient, at least
47800	          BENNIES LSD       		there is a certain amount
47900						of appropriateness to the
48000						answer although it doesn't
48100						tell me much about what I
48200	       {RESPONSE 3 PATIENT 7}		asked at least it's not
48300						bizarre.If I had him in my
48400						 office I would feel con-
48500						fident I could get more
48600						information if I didn't
48700						have to go through the
48800						teletype. He's a little more
48900						willing to talk than the
49000						 previous person.Answer
49100						to the question is fairly
49200						appropriate though not 
49300						extensive.Much less of a 
49400						flavor of paranoia than
49500						any of previous responses.
49600	
49700	{6} Dr. - COULD YOU TELL ME HOW      	
49800	          LONG YOU HAVE BEEN IN THE
49900	          HOSPITAL AND SOMETHING
50000	          ABOUT THE CIRCUMSTANCES
50100	          THAT BROUGHT YOU HERE?
50200	    Pt. - CLOSE TO A YEAR AND		Response somewhat appropriate 
50300	          PARANOIA BROUGHT ME 		but doesn't tell me much.
50400	          HERE				The fact that he uses the
50500						word paranoia in the way
50600						 that he does without
50700	      {RESPONSE 5 PATIENT 7}		any other information,indicates
50800						maybe its a label he picked
50900						up on the ward or from his
51000	                                        doctor.
51100						Lack of any kind of under-
51200						standing about  himself.
51300						Dearth, lack of information.
51400						He's in some remission.Seems
51500						somewhat like a put-on.Seems
51600						he was paranoid and is in 
51700						some remission at this time.
51800	
51900	{7} Dr. - COULD YOU SAY SOMETHING
52000	          NOW ABOUT YOUR PARANOID 
52100	          FEELINGS BOTH AT THE 
52200	          TIME OF ADMISSION AND
52300	          DO YOU HAVE SIMILAR FEELINGS
52400	          NOW AND IF SO HOW DO THEY 
52500	          AFFECT YOU?
52600	    Pt. - AT THE TIME OF ADMISSION	This response moves paranoia back
52700	          I THOUGHT THE MAFIA WAS  	up.Stretching reality somewhat to
52800	          AFTER ME AND NOW ITS THE	think Hell's Angels are still in-
52900	          HELL'S ANGELS			terested in him.Somewhat bizarre
53000						in terms of content.Quite paranoid.
53100	      {RESPONSE 8 PATIENT 9}		Still paranoid.Gross and primitive
53200						responses.In middle of interview I
53300						felt patient was in touch but now
53400						responses have more concrete aspect
53500	
53600	{8} Dr. - DO YOU HAVE ANY THOUGHT
53700	          AS TO WHY THESE TWO
53800	          GROUPS WERE AFTER YOU?
53900	    Pt. - BECAUSE I STOPPED SOME 	Response seems far fetched and hard
54000	          OF THEIR DRUG SUPPLY		to believe unless he was a narcotic
54100						agent which I doubt.Sounds some-
54200	      {RESPONSE 9 PATIENT 9}		what grandiose, magical, paranoid
54300						flavor, in general indicates he's
54400						psychotic, paranoid schizophrenic
54500						with delusions about these two 
54600						groups and I wouldn't rule out
54700						some hallucinations as well.Ap-
54800						propriateness or response answers
54900						question in concrete but unbe-
55000						lievable way.
55100	
55200	
55300		The protocol judges were  selected  from  the  1970  American
55400	Psychiatric  Association Directory using a table of random numbers to
55500	select 105 names randomly.  The protocol judges in  this  group  were
55600	not  informed  that  a  computer  was  involved.    Each  of  the 105
55700	psychiatrists were sent transcripts of three interviews along with  a
55800	cover  letter  requesting  participation  in  the  experiment.    The
55900	interview transcripts consisted of:
56000		1)An interview conducted by one of the eight judges with the
56100		  paranoid model,
56200		2)An interview conducted by the same interview judge with a 
56300		  human paranoid patient, and
56400		3)An interview conducted by an independent psychiatrist of a 
56500		  human patient who was not clinically paranoid.
56600	
56700		The  105 names were divided into eight groups, each member of
56800	which received transcripts of two interviews performed by one of  the
56900	eight  interview  judges.  The transcripts were printed so that after
57000	each input-output pair there were two lines of  rating  numbers  such
57100	that  the protocol judges could circle numbers corresponding to their
57200	ratings of both the previous responses of the patient, and an overall
57300	evaluation  of  the  patient  with  regard to the paranoid continuum.
57400	Thirty three protocol judges (a good response  rate  for  psychiatric
57500	questionnaires)  returned the rated protocols properly filled out and
57600	all were used in our data.
57700	
57800		The  interviews  with  nonparanoid  patients were included to
57900	control for the  hypothesis  that  any  teletyped  interview  with  a
58000	patient  might  be  judged  "paranoid".   Since  virtually all of the
58100	ratings of the nonparanoid inter- views  were  0  for  paranoia,  the
58200	hypothesis was falsified.
58300	
58400	
58500	RESULTS
58600		The first index of resemblance examined was  the  simple  one
58700	defined  by the final overall rating given the patient and the model:
58800	which was rated as being more paranoid, the patient,  the  model,  or
58900	neither?  (See  Table  1)  The  protocol  judges  are  more likely to
59000	distinquish the overall paranoid level of the model and the  patient.
59100	In  37.5%  of  the  paired interviews, the interview judges gave tied
59200	scores to the model and the patient as contrasted to only 9%  of  the
59300	protocol  judges.   Of  the  35  non-tied paired ratings 15 rated the
59400	model as more paranoid.  If p is the  theoretical  probability  of  a
59500	judge  judging the model more paranoid than a human paranoid patient,
59600	we find the 95% confidence interval for p to be .27  to  .59.   Since
59700	p=.5  indicates  indistinguishability  of  model  and patient overall
59800	ratings and our observed p=.43, the results support  the  claim  that
59900	the model is a good simulation of a paranoid patient.
60000	
60100	Separate analysis of the strong and weak  versions  of  the  paranoid
60200	model  indicates that indeed the strong model is judged more paranoid
60300	than the patients, the weak version less paranoid.  Thus a change  in
60400	the parameter structure of the paranoid model produces a change along
60500	the dimension of paranoid behavior in the expected direction.
60600	
60700	TABLE 1
60800	Relative final overall ratings of paranoid model vs. paranoid patient
60900	indicating which was given highest overall rating of paranoia at end 
61000	of interview.
61100	INSERT TABLE 1
61200	
61300	
61400	
61500	
61600	
61700	
61800	
61900	
62000	END OF TABLE 1
62100	
62200	The  second index of resemblance is a more sensitive measure based on
62300	the two series of response ratings in  the  paired  interviews.   The
62400	statistic  used  is basically the standardized Mann-Whitney statistic
62500	[Siegel].
62600			INSERT EQUATION
62700	
62800	where R is the sum of the ranks of the response ratings in the series
62900	of ratings given to the model, n the number of responses given by the
63000	model,  m  the  number  of  responses  given  by the patient.  If the
63100	ratings given by a judge are randomly allocated to model and patient,
63200	i.e. model and patient are indistinguishable in response ratings, the
63300	expected value of Z is 0, with unit standard  deviation.   If  higher
63400	ratings  are  more  likely to be assigned to the model, Z is positive
63500	and, conversely, negative values of Z indicate greater likelihood  of
63600	assigning  higher  ratings to the patient. Each judge in evaluating a
63700	pair of interviews generates a single value of Z.
63800	
63900	The  overall  mean  of  the  Z  scores  was  -.044  with the standard
64000	deviation 1.68(df=40).  Thus the overall 95% confidence interval  for
64100	the  asymtotic mean value of Z -.485 to +.573.  The range of Z values
64200	is -3.8 to +4.46. The length of the confidence interval is  a  result
64300	of  the large variance which itself is mainly related to the contrast
64400	between the weak and strong versions.  (See TABLES 2  and  3).   Once
64500	again  the  strong  version  of  the  model is more paranoid than the
64600	patients, the weak version less paranoid.
64700	
64800	TABLE 2
64900	Summary statistics of Z ratings by group
65000		In this design eight psychiatrists  interviewed  by  teletype
65100		INSERT TABLE 2
65200	
65300	
65400	
65500	
65600	
65700	
65800	
65900	
66000	
66100		END OF TABLE 2
66200	All judges (both interview and protocol) who evaluated the same  pair
66300	of  interviews are referred to as a "group".  Strong groups evaluated
66400	strong versions of the paranoid model, while  weak  groups  evaluated
66500	weak versions of the model.
66600	
66700	It  is  not  surprising  that  results  using  the  two  indices   of
66800	resemblance  are parallel, since the indices are highly interrelated.
66900	The mean Z value for the 15 interviews on which the model  was  rated
67000	more  paranoid  was +1.28, on the 6 where model and patient tied:.41,
67100	on the 20 in which the patient was more paranoid:-.993.   A  positive
67200	value  of Z was observed when the patient was given an overall rating
67300	greater than the model 6 times;a negative value of Z when  the  model
67400	was rated more paranoid twice.
67500	
67600	TABLE 3
67700	Analysis of Variance of Z Ratings
67800	INSERT TABLE 3
67900	
68000	
68100	
68200	
68300	
68400	
68500	
68600	
68700	
68800	END OF TABLE 3
68900	
69000	level of guessing.
69100	
69200	
69300	DISCUSSION
69400		The results of this experiment  indicate  our  simulation  of
69500	paranoid   pro-   cesses   to   be   successful   relative   to   the
69600	indistinguishability  tests  utilized.   Thus  it  is  an  acceptable
69700	simulation as measured by the standard proposed.
69800	
69900		It is worth emphasizing that our test invited  refutation  of
70000	the  model.  The  experimental  design  of the tests put the model in
70100	jeopardy of falsi- fication.  If the paranoid model did  not  survive
70200	these  tests,  i.e.  if  it  were  not  considered paranoid by expert
70300	judges, if there were no correlation between the weak-strong versions
70400	of  the  model  and  the  severity ratings of the judges, and if they
70500	could  they  could  distinguish  actual  patient  inter-  views  from
70600	computer  program  interviews, then no claim regarding the success of
70700	the simulation could be made.  Survival of a falsification proceedure
70800	constitutes a validating step.
70900	
71000		It is historically significant that  these  experiments  were
71100	conducted  at  all. To our knowledge no one to date has subjected his
71200	model   of   human   mental    processes    to    such    challenging
71300	indistinguishability tests.  Other competing models are needed in the
71400	field of psychopathology.  These tests set a precedent and provide  a
71500	standard  for  competing  models to be measured against.  The general
71600	area of computer simulation of mental processes needs not only better
71700	models but better tests and statistical measures of resemblance.  The
71800	problems of appropriate critical experimental  designs  and  measures
71900	provide a promising frontier for future work.
72000	non-verbal   cues   are   made  impossible  (Hilf,1972).  Each  judge
72100		To ask the machine-question, we sent  interview  transcripts,
72200	one  with a patient and one with PARRY, to 100 psychiatrists randomly
72300	selected from the Directory of American Specialists and the Directory
72400	of  the  American Psychiatric Association. Of the 41 replies 21 (51%)
72500	made the correct identification while 20 (49%) were wrong.  Based  on
72600	this  random  sample of 41 psychiatrists, the 95% confidence interval
72700	is between 35.9 and 66.5, a range which  is  close  to  chance.  (Our
72800	statistical   consultant   was  Dr.   Helena  C.   Kraemer,  Research
72900	Associate  in  Biostatistics,  Department  of  Psychiatry,   Stanford
73000	University.)
73100		Psychiatrists   are   considered  expert  judges  of  patient
73200	interview behavior but they are unfamiliar with computers.  Hence  we
73300	conducted  the  same  test  with  100  computer  scientists  randomly
73400	selected from the membership list of the  Association  for  Computing
73500	Machinery,  ACM.   Of the 67 replies 32 (48%) were right and 35 (52%)
73600	were wrong. Based on this random sample of 67 computer scientists the
73700	95% confidence ranges from 36 to 60, again close to a chance level.
73800		Thus the answer to this machine-question "can expert  judges,
73900	psychiatrists  aand  computer scientists, using teletyped transcripts
74000	of psychiatric interviews, distinguish between paranoid patients  and
74100	a  simulation  of paranoid processes? " is "No". But what do we learn
74200	from this?   It is some comfort that the answer was not "yes"and  the
74300	null  hypothesis  (no  differences) failed to be rejected, especially
74400	since statistical tests are somewhat biased in favor of rejecting the
74500	null  hypothesis  (Meehl,1967). Yet this answer does not tell us what
74600	we  would  most  like  to  know,  i.e.  how  to  improve  the  model.
74700	Simulation  models  do  not  spring  forth in a complete, perfect and
74800	final form; they must be gradually developed  over  time.  Pehaps  we
74900	might  obtain  a "yes" answer to the machine-question if we allowed a
75000	large number of expert judges to conduct  the  interviews  themselves
75100	rather  than studying transcripts of other interviewers.     It would
75200	indicate that the model must be improved but unless we systematically
75300	investigated how the judges succeeded in making the discrimination we
75400	would not know what aspects of the model to work on. The logistics of
75500	such a design are immense and obtaining a large N of judges for sound
75600	statistical inference would require an effort disproportionate to the
75700	information-yield.
75800			MULTIDIMENSIONAL EVALUATION
75900		A more efficient and informative way to use Turing-like tests
76000	is to ask judges to make ordinal ratings along scaled dimensions from
76100	teletyped  interviews.     We  shall  term  this  approach asking the
76200	dimension-question.   One can then compare scaled ratings received by
76300	the patients and by the model to precisely determine where and by how
76400	much they differ.        Model builders  strive  for  a  model  which
76500	shows     indistinguishability     along    some    dimensions    and
76600	distinguishability along others. That is, the model converges on what
76700	it is supposed to simulate and diverges from that which it is not.
76800		We  mailed  paired-interview  transcripts  to   another   400
76900	randomly  selected psychiatrists asking them to rate the responses of
77000	the two `patients' along certain dimensions. The judges were  divided
77100	into  groups,  each  judge  being asked to rate responses of each I-O
77200	pair in the interviews along four dimensions.  The  total  number  of
77300	dimensions  in  this  test  were twelve- linguistic noncomprehension,
77400	thought disorder, organic brain syndrome, bizarreness,  anger,  fear,
77500	ideas  of  reference, delusions, mistrust, depression, suspiciousness
77600	and mania. These are dimensions which psychiatrists commonly  use  in
77700	evaluating patients.
77710			(INSERT TABLE 4 HERE)
77800		Table 4 shows there were significant differences, with  PARRY
77900	receiving   higher   scores   along   the  dimensions  of  linguistic
78000	noncomprehension,thought disorder, bizarreness, anger,  mistrust  and
78100	suspiciousness. On the dimension of delusions the patients were rated
78200	significantly higher. There were no significant differences along the
78300	dimensions  of  organic  brain  syndrome,fear,  ideas  of  reference,
78400	depression and mania.
78500		While   tests   asking    the    machine-question    indicate
78600	indistinguishability  at  the  gross  level,  a study of the finer or
78700	micro-structure of the model's behavior through ratings along  scaled
78800	dimensions   shows   statistically  significant  differences  between
78900	patients and model.     These differences are of help  to  the  model
79000	builder in suggesting which aspects of the model must be modified and
79100	improved in order to be considered  an  adequate  simulation  of  the
79200	class  of paranoid patients it is intended to simulate.  For example,
79300	it is clear that PARRY'S  language-comprehension  must  be  improved.
79400	Once  this  has  been implemented, a future test will tell us whether
79500	improvement has occurred and by how much in comparison to the earlier
79600	version.    Successive  identification of particular areas of failure
79700	in the model permits their improvement and the  development  of  more
79800	adequate model-versions.
79900		Further evidence that the machine-question is too coarse  and
80000	insensitive  a test comes from the following experiment. In this test
80100	we constructed a random version of the paranoid model which  utilized
80200	PARRY'S  output statements but expressed them randomly no matter what
80300	the interviewer said.   Two psychiatrists conducted  interviews  with
80400	this  model, transcripts of which were paired with patient interviews
80500	and sent to 200  randomly  selected  psychiatrists  asking  both  the
80600	machine-question  and  the dimension-question.  Of the 69 replies, 34
80700	(49%) were right and 35 (51%) wrong. Based on this random  sample  of
80800	69  psychiatrists,  the 95% confidence interval ranges from 39 to 63,
80900	again indicating  a  chance  level.  However  as  shown  in  Table  5
81000	significant  differences  appear  along  the dimensions of linguistic
81100	noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
81200	rated  higher.  On  these  particular  dimensions  we can construct a
81300	continuum in which the random version  represents  one  extreme,  the
81400	actual patients another. Our (nonrandom) PARRY lies somewhere between
81500	these two extremes, indicating that it performs significantly  better
81600	than  the  random version but still requires improvement before being
81700	indistinguishable from patients.(See Fig.1-graph). Table 6 presents t
81800	values   for   differences   between   mean   ratings  of  PARRY  and
81900	RANDOM-PARRY. (See Table 5 and Fig.1 for the mean ratings).
82000		Thus it can be seen that  such a multidimensional evaluation
82100	provides  yardsticks  for measuring the adequacy of this or any other
82200	dialogue simulation model along the relevant dimensions.
82300		We conclude that when model builders want to conduct tests of
82400	adequacy which indicate in  which  direction  progress  lies  and  to
82500	obtain  a  measure  of whether progress is being achieved, the way to
82600	use Turing-like tests is to ask expert judges to make  ratings  along
82700	multiple   dimensions  that  are  essential  to  the  model.  A  good
82800	validation procedure has criteris for better or worse approximations.
82900	Useful  tests  do  not prove a model, they probe it for its strengths
83000	and weaknesses and clarify what is to be done next in  modifying  and
83100	repairing the model. Simply asking the machine-question yields little
83200	information relevant to what the model builder most  wants  to  know,
83300	namely, along what dimensions must the model be improved.
83400	
83500